Prosper Loan Data Exploration

by Vilija Vaitkeviciute

Preliminary Wrangling

The loan data from Prosper, an online lending platform that connects people who want to borrow money with individuals and institutions, is explored here. The loan data includes loan characteristics, Prosper's internal tracking data, borrower profile and some lender information.

The analysis will focus on

  • understanding how the lender's estimated return relates to various factors, and
  • relationship between expected losses and actual losses.

The first question is of interest for potential lenders, who may not be familiar with loan pricing mechanism, or even for auditors and competitors, while the second question is a proxy for how a return estimate relates to actual returns. The latter is a proxy because losses are one part of return (the other part being yield and other fees).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set(style="white")
In [2]:
# load dataset
df = pd.read_csv('prosperLoanData.csv')
df.head()
Out[2]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [3]:
df.shape
Out[3]:
(113937, 81)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
ListingKey                             113937 non-null object
ListingNumber                          113937 non-null int64
ListingCreationDate                    113937 non-null object
CreditGrade                            28953 non-null object
Term                                   113937 non-null int64
LoanStatus                             113937 non-null object
ClosedDate                             55089 non-null object
BorrowerAPR                            113912 non-null float64
BorrowerRate                           113937 non-null float64
LenderYield                            113937 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha)                  84853 non-null object
ProsperScore                           84853 non-null float64
ListingCategory (numeric)              113937 non-null int64
BorrowerState                          108422 non-null object
Occupation                             110349 non-null object
EmploymentStatus                       111682 non-null object
EmploymentStatusDuration               106312 non-null float64
IsBorrowerHomeowner                    113937 non-null bool
CurrentlyInGroup                       113937 non-null bool
GroupKey                               13341 non-null object
DateCreditPulled                       113937 non-null object
CreditScoreRangeLower                  113346 non-null float64
CreditScoreRangeUpper                  113346 non-null float64
FirstRecordedCreditLine                113240 non-null object
CurrentCreditLines                     106333 non-null float64
OpenCreditLines                        106333 non-null float64
TotalCreditLinespast7years             113240 non-null float64
OpenRevolvingAccounts                  113937 non-null int64
OpenRevolvingMonthlyPayment            113937 non-null float64
InquiriesLast6Months                   113240 non-null float64
TotalInquiries                         112778 non-null float64
CurrentDelinquencies                   113240 non-null float64
AmountDelinquent                       106315 non-null float64
DelinquenciesLast7Years                112947 non-null float64
PublicRecordsLast10Years               113240 non-null float64
PublicRecordsLast12Months              106333 non-null float64
RevolvingCreditBalance                 106333 non-null float64
BankcardUtilization                    106333 non-null float64
AvailableBankcardCredit                106393 non-null float64
TotalTrades                            106393 non-null float64
TradesNeverDelinquent (percentage)     106393 non-null float64
TradesOpenedLast6Months                106393 non-null float64
DebtToIncomeRatio                      105383 non-null float64
IncomeRange                            113937 non-null object
IncomeVerifiable                       113937 non-null bool
StatedMonthlyIncome                    113937 non-null float64
LoanKey                                113937 non-null object
TotalProsperLoans                      22085 non-null float64
TotalProsperPaymentsBilled             22085 non-null float64
OnTimeProsperPayments                  22085 non-null float64
ProsperPaymentsLessThanOneMonthLate    22085 non-null float64
ProsperPaymentsOneMonthPlusLate        22085 non-null float64
ProsperPrincipalBorrowed               22085 non-null float64
ProsperPrincipalOutstanding            22085 non-null float64
ScorexChangeAtTimeOfListing            18928 non-null float64
LoanCurrentDaysDelinquent              113937 non-null int64
LoanFirstDefaultedCycleNumber          16952 non-null float64
LoanMonthsSinceOrigination             113937 non-null int64
LoanNumber                             113937 non-null int64
LoanOriginalAmount                     113937 non-null int64
LoanOriginationDate                    113937 non-null object
LoanOriginationQuarter                 113937 non-null object
MemberKey                              113937 non-null object
MonthlyLoanPayment                     113937 non-null float64
LP_CustomerPayments                    113937 non-null float64
LP_CustomerPrincipalPayments           113937 non-null float64
LP_InterestandFees                     113937 non-null float64
LP_ServiceFees                         113937 non-null float64
LP_CollectionFees                      113937 non-null float64
LP_GrossPrincipalLoss                  113937 non-null float64
LP_NetPrincipalLoss                    113937 non-null float64
LP_NonPrincipalRecoverypayments        113937 non-null float64
PercentFunded                          113937 non-null float64
Recommendations                        113937 non-null int64
InvestmentFromFriendsCount             113937 non-null int64
InvestmentFromFriendsAmount            113937 non-null float64
Investors                              113937 non-null int64
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [5]:
#converting all dates to datetime
df.ListingCreationDate = pd.to_datetime(df.ListingCreationDate)
df.LoanOriginationDate = pd.to_datetime(df.LoanOriginationDate)
df.DateCreditPulled = pd.to_datetime(df.DateCreditPulled)
df.ClosedDate = pd.to_datetime(df.ClosedDate)
In [6]:
# LoanStatus into ordered categorical type
loanstatus_order = ['Current', 'FinalPaymentInProgress', 'Completed',  'Past Due (1-15 days)',
                    'Past Due (16-30 days)', 'Past Due (31-60 days)', 'Past Due (61-90 days)', 
                    'Past Due (91-120 days)', 'Past Due (>120 days)', 'Defaulted',
                    'Chargedoff']

ordered_var = pd.api.types.CategoricalDtype(ordered = True, categories = loanstatus_order)
df.LoanStatus = df.LoanStatus.astype(ordered_var)

Variable dictionary mentions a change in loan data applicable for loans originated after July 2009; this includes calculations of Effective Yield, Estimated Loss and Effective Return, key variables analysed. Therefore, I will drop the loan listings that occure before July 2009. Before doing this, I will check the loan listing distribution accross time, to ensure there will be sufficient remaining data for the proposed analysis.

In [7]:
plt.figure(figsize=[15,5])
df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar')
plt.xlabel('Listing creation period (year, month)');

There appears to be sufficient data for analysis for post July 2009 period. Will drop earlier observations (in line when first EstimatedReturn is recorded).

In [8]:
min_date = min(df[~df.EstimatedReturn.isnull()].ListingCreationDate)
min_date
Out[8]:
Timestamp('2009-07-13 18:01:24.347000')
In [9]:
df = df[df.ListingCreationDate>= min_date]
df.shape
Out[9]:
(84853, 81)

As the dataset contains 81 variables, I will create a correlation matrix for the numeric variables, in order to make a preliminary decision as to which set of variables are likely to relate to variables of interest.

This will not be conclusive analysis, because, among others:

  • linear correlation is not the only type of relashionship,
  • this excludes any categorical variables,
  • in cases where relationship between variables is expected, but corr. coefficient is low, it is interesting to dig further.
In [10]:
#lets check correlations of all numeric variables
plt.figure(figsize=[20,20])

#compute correlations
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Plot heatmap
sns.heatmap(corr, cmap = 'RdBu', vmin = -1, vmax = 1, mask=mask, linewidths=0.5)
plt.title('Correlation table (numerical variables)');

What is the structure of your dataset?

There are 84,853 observations in the dataset (abreviated to include only post 13-July-2009 observations), where 13-July-2009 marks introduction of EstimatedReturn, EstimatedLoss, Estimated Yield, ProsperScore and other key variables to the analysis.

The dataset contains 81 features, of which majority are numerical (61), categorical (17) and boolean (3); some of the numerical variables are numerical categories or ratings (such as 'ProsperRating (numeric)', ListingCategory, ProsperScore).

What is/are the main feature(s) of interest in your dataset?

The analysis will focus on

  • understanding how the lender's estimated return relates to various factors, and
  • relationship between expected losses and actual losses.

The first question is of interest for potential lenders, who may not be familiar with loan pricing mechanism, or even for auditors and competitors, while the second question is a proxy for how a return estimate relates to actual returns. The latter is a proxy because losses are one part of return (the other part being yield and other fees).

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Variables correlated with EstimatedReturn (on linear basis):

The following are numerical variables with higher absolute correlation coefficients with estimated lender return, categorised:

  • EstimatedReturn components (estimates/actuals) ('EstimatedEffectiveYield', 'EstimatedLoss', 'LP_NetPrincipalLoss'),
  • Borrower's current and historical need/utisation of credit facilities ('InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years', 'BankcardUtilization', 'AvailableBankCardCredit'),
  • Borrower's credit score as provided by external agencies ('CreditScoreRangeLower', 'CreditScoreRangeUpper'),
  • Borrower's income profile ('DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner'),
  • Loan's properties ('LoanOriginalAmount', 'Term', 'ListingCategory (numeric)', 'LoanStatus')
Except:

Few variables, that are highly correlated such as :

  • BorrowerRate, BorrowerAPR are excluded, as they are expected to be linear to EstimatedReturns (e.g. in this case, difference is Prosper's margin).
  • LoanNumber, which is "a unique numeric value associated with a loan", and appears to represent internal loan tracking number
  • LP_ServiceFees - various return components, that should be minor and/or constant accross loans (and could be investigated in broader scopeexamination)
  • LP_InterestandFees - applicable only to defaulted loans, and should be related formulaically to loan yield and amount (and could be investigated in broader scope examination)

Further variables

Additional variables (non numerical) to be included in analysis:

  • date variables for time trends analysis, which in any, especially financial data is of importance ('ListingCreationDate', 'ClosedDate', 'LoanOriginationDate')
  • categorical variables, which were not included in correlation analysis ('Occupation')
  • variables that will need to be transformed, e.g because being abslute variables, perhaps could not have produced linear correlation with EstimatedReturns (e.g 'InvestmentFromFriendsAmount' and 'LoanOriginalAmount' can be used to calculate % of loan funded by friends)

Univariate Exploration

EstimatedReturn components (estimates)

In [11]:
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [12,3])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var], bins=50)
    ax[j].set_xlabel('{}'.format(var))
In [12]:
df[df.EstimatedReturn < 0]['ListingKey'].count(), df[df.EstimatedEffectiveYield < 0]['ListingKey'].count()
Out[12]:
(195, 190)
In [13]:
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [12,3])
variables = ['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var][df[var]>0], bins=50)
    ax[j].set_xlabel('{}'.format(var))

Loss (estimated/actual)

In [14]:
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss == 0].count())
print(df.LP_NetPrincipalLoss[df.LP_NetPrincipalLoss != 0].count())
78670
6183
In [15]:
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])

variables = ['LoanOriginalAmount', 'LP_NetPrincipalLoss']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var][df[var]>0], bins=20)
    ax[j].set_xlabel('{}'.format(var));
In [16]:
df['LoanLossRatio'] = df.LP_NetPrincipalLoss/df.LoanOriginalAmount
In [17]:
plt.figure(figsize = [6,4])
plt.hist(df.LoanLossRatio[(df.LoanLossRatio > 0)], bins=100);

Borrower's current and historical need/utisation of credit facilities

In [18]:
columns = 5
fig, ax = plt.subplots(ncols=columns, figsize = [16,3])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years', 
             'BankcardUtilization', 'AvailableBankcardCredit']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var], bins=50)
    ax[j].set_xlabel('{}'.format(var))
In [19]:
columns = 5
rows = 2
fig, ax = plt.subplots(nrows=rows, ncols=columns, figsize = [columns*4,rows*3.5])
variables = ['InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years', 
             'BankcardUtilization', 'AvailableBankcardCredit']

for i in range(rows):
    for j in range(columns):
        var = variables[j]
        step = max(df[var])/100
        bins = np.arange(0,max(df[var])+step,step)
        if i==1:
            step = np.log10(df[var].max())/50
            bins = 10 ** np.arange(0, np.log10(df[var].max())+step, step)
            ax[i, j].set_xscale('log')
        ax[i,j].hist(df[var][(df[var] > 0)], bins=bins)
        ax[i, j].set_xlabel('{}'.format(var))    

Borrower's credit score as provided by external agencies

In [20]:
columns = 2
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])

variables = ['CreditScoreRangeLower', 'CreditScoreRangeUpper']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var], bins=20)
    ax[j].set_xlabel('{}'.format(var))
In [21]:
sns.scatterplot(data=df, x='CreditScoreRangeLower', y='CreditScoreRangeUpper', alpha = 0.5)
plt.title("Credit Score Lower vs Upper Range Barrier")
plt.xlabel("Lower Barrier")
plt.ylabel("Upper Barrier");
In [22]:
df['CreditScoreMid'] = (df.CreditScoreRangeUpper+df.CreditScoreRangeLower)/2

Borrower's income profile

'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner'

In [23]:
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))

bins = np.arange(-1,max(df.DebtToIncomeRatio),0.5)
ax[0].hist(df.DebtToIncomeRatio, bins=bins)

bins = np.arange(0,2,0.1)
ax[1].hist(df.DebtToIncomeRatio, bins=bins);
In [24]:
plt.figure(figsize = [6,4])
bins = np.arange(-1,max(df.StatedMonthlyIncome),10000)
plt.hist(df.StatedMonthlyIncome, bins=bins);
In [25]:
df.StatedMonthlyIncome[df.StatedMonthlyIncome > 50000].count()
Out[25]:
61
In [26]:
df.StatedMonthlyIncome.describe()
Out[26]:
count    8.485300e+04
mean     5.931175e+03
std      8.239944e+03
min      0.000000e+00
25%      3.434000e+03
50%      5.000000e+03
75%      7.083333e+03
max      1.750003e+06
Name: StatedMonthlyIncome, dtype: float64
In [27]:
bins = np.arange(-1,50000,1000)
plt.hist(df.StatedMonthlyIncome, bins=bins);
In [28]:
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))

ax[0].hist(df.MonthlyLoanPayment, bins=100)

bins = np.arange(0,1500, 50)
ax[1].hist(df.MonthlyLoanPayment, bins=bins)
ax[1].set_xticks(bins)
ax[1].set_xticklabels(bins,rotation = 90);
In [29]:
df.IsBorrowerHomeowner.value_counts().plot(kind='barh', figsize=[6,4]);
In [30]:
df.Occupation.value_counts().head(50).plot(kind='bar', figsize = [12,4]);

Properties of the loan

('ListingCategory (numeric)', 'InvestmentFromFriendsCount')

In [31]:
df.LoanStatus.value_counts().sort_index().plot(kind='bar', figsize = [6,4]);
In [32]:
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))

ax[0].hist(df.LoanOriginalAmount, bins=100)

bins = np.arange(0,35000,1000)
ax[1].hist(df.LoanOriginalAmount, bins=bins);
In [33]:
df.Term.value_counts().sort_index().plot(kind='bar', figsize=[6,4]);
In [34]:
df['ListingCategory (numeric)'].value_counts().sort_index().plot(kind='bar', figsize = [6,4]);
In [35]:
fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))

df['LoanStatus'].value_counts().sort_index().plot(kind='bar', ax=ax[0])
df['LoanStatus'].value_counts().sort_index()[3:].plot(kind='bar', ax=ax[1]);

Date variables

'ListingCreationDate', 'ClosedDate', 'LoanOriginationDate'

In [36]:
fig, ax = plt.subplots(nrows=3, figsize=[12,12])

df.groupby([df.ListingCreationDate.dt.year,df.ListingCreationDate.dt.month])['ListingCreationDate'].count().plot(kind='bar', 
                                                                                                                 ax=ax[0])
ax[0].set_xlabel('Listing creation period (year, month)')

df.groupby([df.LoanOriginationDate.dt.year,df.LoanOriginationDate.dt.month])['LoanOriginationDate'].count().plot(kind='bar', 
                                                                                                                 ax=ax[1])
ax[1].set_xlabel('Loan origination period (year, month)')

df.groupby([df.ClosedDate.dt.year,df.ClosedDate.dt.month])['ClosedDate'].count().plot(kind='bar', 
                                                                                      ax=ax[2])
ax[2].set_xlabel('Listing closed period (year, month)')

plt.subplots_adjust(hspace=0.7);
In [37]:
warehouse_range = (df.LoanOriginationDate - df.ListingCreationDate).dt.days

fig, ax=plt.subplots(ncols=2, figsize=(2*6,4))

ax[0].hist(warehouse_range, bins=100);

step = 0.25
bins = 10**np.arange(0, np.log10(warehouse_range.max())+step, step)
ax[1].hist(warehouse_range, bins=bins)
ax[1].set_xscale('log')
plt.xticks([1,3,10,100,300], [1,3,10,100,300]);
In [38]:
warehouse_range.describe()
Out[38]:
count    84853.000000
mean        10.689133
std         17.708353
min          0.000000
25%          3.000000
50%          7.000000
75%         12.000000
max        529.000000
dtype: float64

Other variables

In [39]:
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount == 0].count())
print(df.InvestmentFromFriendsAmount[df.InvestmentFromFriendsAmount > 0].count())
84267
586
In [40]:
columns = 3
fig, ax = plt.subplots(ncols=columns, figsize = [6*columns,4])

df['FriendInvestmentRatio'] = df.InvestmentFromFriendsAmount / df.LoanOriginalAmount

variables = ['InvestmentFromFriendsAmount', 'LoanOriginalAmount', 'FriendInvestmentRatio']

for j in range(columns):
    var = variables[j]
    ax[j].hist(df[var][df[var]>0], bins=20)
    ax[j].set_xlabel('{}'.format(var))

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Your answer here!

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Your answer here!

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

In [41]:
df = df[['EstimatedReturn', 'EstimatedEffectiveYield',  'EstimatedLoss', 
         'LP_NetPrincipalLoss', 'LoanOriginalAmount', 'LoanLossRatio',
         'InquiriesLast6Months', 'CurrentDelinquencies', 'DelinquenciesLast7Years', 
         'BankcardUtilization', 'AvailableBankcardCredit', 
         'CreditScoreMid',  
         'DebtToIncomeRatio', 'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner', 'Occupation',
         'Term', 'ListingCategory (numeric)', 'LoanStatus',
         'ListingCreationDate', 'LoanOriginationDate', 'ClosedDate',
         'InvestmentFromFriendsAmount', 'FriendInvestmentRatio']]
In [42]:
df.columns
Out[42]:
Index(['EstimatedReturn', 'EstimatedEffectiveYield', 'EstimatedLoss',
       'LP_NetPrincipalLoss', 'LoanOriginalAmount', 'LoanLossRatio',
       'InquiriesLast6Months', 'CurrentDelinquencies',
       'DelinquenciesLast7Years', 'BankcardUtilization',
       'AvailableBankcardCredit', 'CreditScoreMid', 'DebtToIncomeRatio',
       'StatedMonthlyIncome', 'MonthlyLoanPayment', 'IsBorrowerHomeowner',
       'Occupation', 'Term', 'ListingCategory (numeric)', 'LoanStatus',
       'ListingCreationDate', 'LoanOriginationDate', 'ClosedDate',
       'InvestmentFromFriendsAmount', 'FriendInvestmentRatio'],
      dtype='object')
In [43]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84853 entries, 1 to 113936
Data columns (total 25 columns):
EstimatedReturn                84853 non-null float64
EstimatedEffectiveYield        84853 non-null float64
EstimatedLoss                  84853 non-null float64
LP_NetPrincipalLoss            84853 non-null float64
LoanOriginalAmount             84853 non-null int64
LoanLossRatio                  84853 non-null float64
InquiriesLast6Months           84853 non-null float64
CurrentDelinquencies           84853 non-null float64
DelinquenciesLast7Years        84853 non-null float64
BankcardUtilization            84853 non-null float64
AvailableBankcardCredit        84853 non-null float64
CreditScoreMid                 84853 non-null float64
DebtToIncomeRatio              77557 non-null float64
StatedMonthlyIncome            84853 non-null float64
MonthlyLoanPayment             84853 non-null float64
IsBorrowerHomeowner            84853 non-null bool
Occupation                     83520 non-null object
Term                           84853 non-null int64
ListingCategory (numeric)      84853 non-null int64
LoanStatus                     84853 non-null category
ListingCreationDate            84853 non-null datetime64[ns]
LoanOriginationDate            84853 non-null datetime64[ns]
ClosedDate                     26005 non-null datetime64[ns]
InvestmentFromFriendsAmount    84853 non-null float64
FriendInvestmentRatio          84853 non-null float64
dtypes: bool(1), category(1), datetime64[ns](3), float64(16), int64(3), object(1)
memory usage: 15.7+ MB
In [44]:
#lets check correlations of all numeric variables
plt.figure(figsize=[20,20])

#compute correlations
corr = df.corr()

# Plot heatmap
sns.heatmap(corr, cmap = "RdBu", vmin = -1, vmax = 1, linewidths=0.5, annot = True, fmt = '.2f')
plt.title('Correlation table (numerical variables)');
In [45]:
fig, ax = plt.subplots(ncols=2, figsize=[16,5])


sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield, 
           fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[0])

sns.regplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss, 
           fit_reg = True, x_jitter= 0.3, scatter_kws= {'alpha' : 1/50}, ax = ax[1]);
In [46]:
fig, ax = plt.subplots(ncols=2, figsize=[16,5])

sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield, ci = 'sd',ax = ax[0])
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss, ci = 'sd', ax = ax[1]);
In [47]:
fig, ax = plt.subplots(ncols=2, figsize=[16,5])

sns.scatterplot(data=df, x='EstimatedReturn', y='EstimatedEffectiveYield', 
                hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')

sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)], 
                x='EstimatedReturn', y='EstimatedEffectiveYield', hue = df.ListingCreationDate.dt.year, 
                alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left');
In [48]:
df[df.EstimatedReturn!=df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
Out[48]:
2013    35413
2012    19556
2011    11442
2014    10734
2010      257
Name: ListingCreationDate, dtype: int64
In [49]:
df[df.EstimatedReturn==df.EstimatedEffectiveYield].ListingCreationDate.dt.year.value_counts()
Out[49]:
2010    5273
2009    2178
Name: ListingCreationDate, dtype: int64
In [50]:
fig, ax = plt.subplots(ncols=2, figsize=[16,5])

sns.scatterplot(data=df, x='EstimatedEffectiveYield', y='EstimatedLoss', 
                hue = df.ListingCreationDate.dt.year, alpha = 0.5, ax=ax[0], legend='full')
ax[0].legend(loc='upper left')


sns.scatterplot(data=df[(df.EstimatedReturn!=df.EstimatedEffectiveYield)&(df.EstimatedReturn>0)], 
                x='EstimatedEffectiveYield', y='EstimatedLoss', hue = df.ListingCreationDate.dt.year, 
                alpha = 0.5, ax=ax[1], legend='full')
ax[1].legend(loc='upper left');
In [51]:
df[df.EstimatedReturn!=df.EstimatedEffectiveYield][['EstimatedReturn','EstimatedEffectiveYield','EstimatedLoss']].corr()
Out[51]:
EstimatedReturn EstimatedEffectiveYield EstimatedLoss
EstimatedReturn 1.000000 0.888137 0.718346
EstimatedEffectiveYield 0.888137 1.000000 0.957712
EstimatedLoss 0.718346 0.957712 1.000000
In [52]:
fig, ax = plt.subplots(ncols=2,figsize=[16,5])

del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield, ci = 'sd', 
              linestyles='--', ax=ax[0])
ax[0].set_xticklabels(labels=del_bins,rotation = 90)

inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield, ci = 'sd', 
              linestyles='--', ax=ax[1])
ax[1].set_xticklabels(labels=inq_bins,rotation = 90);
In [53]:
status_list = ['Current', 'Completed', 'Defaulted', 'Chargedoff'] 

df_small = df[df.LoanStatus.isin(status_list)]

fig, ax = plt.subplots(ncols=2,figsize=[18,5])
colors = ['pure blue', 'light grey', 'baby blue', 'light grey', 'light grey', 
          'light grey', 'light grey', 'light grey', 'light grey',
          'lipstick', 'rosa']
palette = sns.xkcd_palette(colors)

del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df_small.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus,
              ci = 'sd', linestyles='', ax=ax[0], dodge=True, palette=palette)
ax[0].set_xticklabels(labels=del_bins,rotation = 90)
ax[0].legend(ncol=3,loc='lower left', fontsize=8)

inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df_small.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df_small.EstimatedEffectiveYield, hue=df_small.LoanStatus, 
              ci = 'sd', linestyles='', ax=ax[1], dodge=True, palette=palette)
ax[1].set_xticklabels(labels=inq_bins,rotation = 90)
ax[1].legend(ncol=3,loc='lower left', fontsize=8);
In [54]:
plt.figure(figsize=[12,5])

df_postitive_yield=df[df.EstimatedEffectiveYield>0]

sns.scatterplot(x=df_postitive_yield.AvailableBankcardCredit, y=df_postitive_yield.EstimatedEffectiveYield, 
                alpha = 0.03, color='steelblue')
plt.xscale('log')
plt.xticks([1,100,1000,10000,30000, 100000,1000000],[1,100,'1k','10k','30k','100k','1m']);
In [55]:
sns.set(style='whitegrid')
plt.figure(figsize=[12,5])

AvailableBankCredit_bins = pd.cut(df_postitive_yield.AvailableBankcardCredit, [100,1000,10000,30000, 100000,1000000], 
                                  include_lowest = True)

sns.violinplot(x=AvailableBankCredit_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue',
              inner='quartile');
In [56]:
plt.figure(figsize=[12,5])

sns.boxplot(x=df_postitive_yield.CreditScoreMid, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue');
In [57]:
fit, ax = plt.subplots(ncols=2, figsize=[16,5])

df_temp = df_postitive_yield[(df_postitive_yield.DebtToIncomeRatio>0)&(df_postitive_yield.DebtToIncomeRatio<1)]

sns.scatterplot(x=df_temp.DebtToIncomeRatio, y=df_temp.EstimatedEffectiveYield, 
                alpha = 0.03, color='steelblue', ax=ax[0])

DebtToIncome_bins = pd.cut(df_temp.DebtToIncomeRatio, [0,0.2,0.3,0.4,0.5,0.75,1], 
                                  include_lowest = True)

sns.violinplot(x=DebtToIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
              inner='quartile');
In [58]:
fit, ax = plt.subplots(ncols=2, figsize=[16,5])

df_temp = df_postitive_yield[(df_postitive_yield.StatedMonthlyIncome>0)&(df_postitive_yield.StatedMonthlyIncome<40000)]

sns.scatterplot(x=df_temp.StatedMonthlyIncome, y=df_temp.EstimatedEffectiveYield, 
                alpha = 0.03, color='steelblue', ax=ax[0])

StatedMonthlyIncome_bins = pd.cut(df_temp.StatedMonthlyIncome, [0,2500,5000,7500,10000,20000,40000], 
                                  include_lowest = True)

sns.violinplot(x=StatedMonthlyIncome_bins, y=df_temp.EstimatedEffectiveYield, color='steelblue',
              inner='quartile')
plt.xticks(rotation=45);
In [59]:
plt.figure(figsize=[6,5])

sns.boxplot(x=df_postitive_yield.IsBorrowerHomeowner, y=df_postitive_yield.EstimatedEffectiveYield, 
            color='steelblue');
In [60]:
fix, ax=plt.subplots(ncols=2, figsize=[16,5])

occupation_lowyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().head(10).index
occupation_highyield = df_postitive_yield.groupby('Occupation')['EstimatedEffectiveYield'].mean().sort_values().tail(10).index

df_temp_low = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_lowyield)]
df_temp_high = df_postitive_yield[df_postitive_yield.Occupation.isin(occupation_highyield)]

sns.boxplot(x=df_temp_low.Occupation, y=df_temp_low.EstimatedEffectiveYield, 
            color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)

sns.boxplot(x=df_temp_high.Occupation, y=df_temp_high.EstimatedEffectiveYield, 
            color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90);
In [61]:
fix, ax=plt.subplots(ncols=2, figsize=[16,5])

mask_low = df_temp_low.StatedMonthlyIncome<30000
mask_high = df_temp_high.StatedMonthlyIncome<30000

sns.boxplot(x=df_temp_low[mask_low].Occupation, y=df_temp_low[mask_low].StatedMonthlyIncome, 
            color='steelblue', order=occupation_lowyield, ax=ax[0])
ax[0].set_xticklabels(occupation_lowyield, rotation=90)

sns.boxplot(x=df_temp_high[mask_high].Occupation, y=df_temp_high[mask_high].StatedMonthlyIncome, 
            color='salmon', order=occupation_highyield, ax=ax[1])
ax[1].set_xticklabels(occupation_highyield, rotation=90);
In [62]:
sns.set(style='white')

fit, ax = plt.subplots(ncols=2, figsize=[16,5])

sns.scatterplot(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.EstimatedEffectiveYield, 
                alpha = 0.05, color='steelblue', ax=ax[0])

OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, [0,5100,10100,15100,
                                                                         20100,25100,30100,35100], include_lowest = True)

sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.EstimatedEffectiveYield, color='steelblue', ax=ax[1])
plt.xticks(rotation=45);
In [63]:
fig, ax = plt.subplots(ncols=2, figsize=[16,5])

loan_bins = [0,5100,10100,15100,20100,25100,30100,35100]
creditscore_bins = np.arange(601,901,25)

h = ax[0].hist2d(x=df_postitive_yield.LoanOriginalAmount, y=df_postitive_yield.CreditScoreMid, 
             bins=[loan_bins,creditscore_bins], cmap="Blues", cmin = 100)
plt.colorbar(h[3], ax=ax[0])

OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)

sns.boxplot(x=OriginalLoanAmount_bins, y=df_postitive_yield.CreditScoreMid, color='steelblue', ax=ax[1])
plt.xticks(rotation=45);
In [64]:
sns.set(style='whitegrid')

fig, ax = plt.subplots(ncols=2, figsize=[16,5])

sns.boxplot(x=df.LoanStatus, y=df.EstimatedLoss, color='steelblue', ax=ax[0])
sns.boxplot(x=df.LoanStatus, y=df.LoanLossRatio, color='steelblue', ax=ax[1])

for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=90);

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Your answer here!

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Your answer here!

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

In [65]:
cred_bins = np.arange(601,901,50)

creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index

fig, ax = plt.subplots(ncols=2, figsize=[16,5])
palette = sns.color_palette("Blues")

sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedEffectiveYield, 
              hue = creditscore_bins, 
              ax = ax[0], palette=palette)
sns.pointplot(x = df.ListingCreationDate.dt.year, y = df.EstimatedLoss, 
              hue = creditscore_bins,
              ax = ax[1], palette=palette);
In [66]:
fig, ax = plt.subplots(ncols=2,figsize=[16,5])

palette = sns.color_palette("Blues")

del_bins = np.arange(0, 100 + 5, 5)
delinq7yr_bins = pd.cut(df.DelinquenciesLast7Years, del_bins, include_lowest = True)
sns.pointplot(x = delinq7yr_bins, y = df.EstimatedEffectiveYield, 
              hue = df.ListingCreationDate.dt.year, 
              ax=ax[0], palette=palette)
ax[0].set_xticklabels(labels=del_bins,rotation = 90)

inq_bins = np.arange(0, 20 + 2, 2)
inquiries_bins = pd.cut(df.InquiriesLast6Months, inq_bins, include_lowest = True)
sns.pointplot(x = inquiries_bins, y = df.EstimatedEffectiveYield, 
              hue = df.ListingCreationDate.dt.year, 
              ax=ax[1], palette=palette)
ax[1].set_xticklabels(labels=inq_bins,rotation = 90);
In [67]:
del_bins = np.arange(0, 100 + 5, 5)
inq_bins = np.arange(0, 20 + 2, 2)
cred_bins = np.arange(601,901,50)

creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index

fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])

palette = sns.color_palette("Blues")

row = 0

for credit_score in creditscore_index:  
    
    df_credit = df[creditscore_bins == credit_score]
    
    delinq7yr_bins = pd.cut(df_credit.DelinquenciesLast7Years, del_bins, include_lowest = True)  
    sns.pointplot(x = delinq7yr_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year, 
                  ax=ax[row,0], palette=palette, ci=None)
    ax[row,0].set_xticklabels(labels=del_bins,rotation = 45)
    ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
    
    inquiries_bins = pd.cut(df_credit.InquiriesLast6Months, inq_bins, include_lowest = True)
    sns.pointplot(x = inquiries_bins, y = df_credit.EstimatedEffectiveYield, hue = df_credit.ListingCreationDate.dt.year, 
                  ax=ax[row,1], palette=palette, ci=None)
    ax[row,1].set_xticklabels(labels=inq_bins,rotation = 45)
    ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
    
    row +=1
    
plt.subplots_adjust(hspace=0.3);
In [68]:
fig, ax = plt.subplots(ncols=2,figsize=[16,5])

palette = sns.cubehelix_palette(8)

ABC_bins = [100,1000,10000,30000, 100000,1000000]
AvailableBankCredit_bins = pd.cut(df.AvailableBankcardCredit, ABC_bins, 
                                  include_lowest = True)
sns.pointplot(x = AvailableBankCredit_bins, y = df.EstimatedEffectiveYield, 
              hue = df.ListingCreationDate.dt.year, 
              ax=ax[0], palette=palette)
ax[0].set_xticklabels(labels=ABC_bins,rotation = 90)

DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
DebtToIncome_bins = pd.cut(df.DebtToIncomeRatio, DTI_bins, 
                                  include_lowest = True)
sns.pointplot(x = DebtToIncome_bins, y = df.EstimatedEffectiveYield, 
              hue = df.ListingCreationDate.dt.year, 
              ax=ax[1], palette=palette)
ax[1].set_xticklabels(labels=DTI_bins,rotation = 90);
In [69]:
ABC_bins = [100,1000,10000,30000, 100000,1000000]
DTI_bins = [0,0.2,0.3,0.4,0.5,0.75,1]
cred_bins = np.arange(601,901,50)

creditscore_bins = pd.cut(df.CreditScoreMid, cred_bins, include_lowest = True)
creditscore_index= creditscore_bins.value_counts().sort_index().index

fig, ax = plt.subplots(nrows =len(creditscore_index), ncols=2,figsize=[16,30])

palette = sns.cubehelix_palette(8)

row = 0

for credit_score in creditscore_index:  
    
    df_credit = df[(creditscore_bins == credit_score)]
    AvailableBankCredit_bins = pd.cut(df_credit.AvailableBankcardCredit, ABC_bins, include_lowest = True)
    sns.pointplot(x = AvailableBankCredit_bins, y = df_credit.EstimatedEffectiveYield, 
                  hue = df_credit.ListingCreationDate.dt.year, 
                  ax=ax[row,0], palette=palette, ci=False)
    ax[row,0].set_xticklabels(labels=ABC_bins,rotation = 45)
    ax[row,0].title.set_text('Credit score ={}'.format(credit_score))
    
    DebtToIncome_bins = pd.cut(df_credit.DebtToIncomeRatio, DTI_bins, include_lowest = True)
    sns.pointplot(x = DebtToIncome_bins, y = df_credit.EstimatedEffectiveYield, 
                  hue = df_credit.ListingCreationDate.dt.year, 
                  ax=ax[row,1], palette=palette, ci=False)
    ax[row,1].set_xticklabels(labels=DTI_bins,rotation = 45)
    ax[row,1].title.set_text('Credit score ={}'.format(credit_score))
    
    row +=1
    
plt.subplots_adjust(hspace=0.3);
In [70]:
sns.set(style='white')

years=df_postitive_yield.LoanOriginationDate.dt.year.value_counts().sort_index().index
loan_bins = [0,5100,10100,15100,20100,25100,35100]
creditscore_bins = np.arange(601,901,25)
creditscore_largebins = [600.001,701,901]

OriginalLoanAmount_bins = pd.cut(df_postitive_yield.LoanOriginalAmount, bins=loan_bins, include_lowest = True)
OriginalLoanAmount_index=OriginalLoanAmount_bins.value_counts().sort_index().index

rows = len(OriginalLoanAmount_index)


fig, ax = plt.subplots(nrows = rows, ncols=2, figsize=[16, rows*5])

row = 0
for loan_amount in OriginalLoanAmount_index:
    
    mask = (OriginalLoanAmount_bins == loan_amount)
    
    h = ax[row,0].hist2d(x=df_postitive_yield[mask].LoanOriginationDate.dt.year, 
                         y=df_postitive_yield[mask].CreditScoreMid, 
                         bins=[years,creditscore_bins], cmap="Blues", cmin = 50)
    plt.colorbar(h[3], ax=ax[row,0])
    ax[row,0].title.set_text('Loan amount range ={}'.format(loan_amount))

    CreditScore_large_bins = pd.cut(df_postitive_yield[mask].CreditScoreMid, bins=creditscore_largebins, 
                                    include_lowest = True)
    sns.boxplot(x=df_postitive_yield[mask].LoanOriginationDate.dt.year, 
                y=df_postitive_yield[mask].EstimatedEffectiveYield, 
                hue=CreditScore_large_bins,
                palette='Blues', ax=ax[row,1])
    ax[row,1].title.set_text('Loan amount range ={}'.format(loan_amount))
    
    row +=1
    
plt.subplots_adjust(hspace=0.3);

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Your answer here!

Were there any interesting or surprising interactions between features?

Your answer here!

At the end of your report, make sure that you export the notebook as an html file from the File > Download as... > HTML menu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!

In [ ]: